Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ngen_cal_model_observations model plugin hook and default implementation #155

Merged
merged 6 commits into from
Aug 15, 2024

Conversation

aaraney
Copy link
Member

@aaraney aaraney commented Aug 5, 2024

ngen_cal_model_observations is called during each calibration iteration to provide truth / observation values in the form of a pandas.Series, indexed by time with a record every simulation_interval. The returned pandas.Series should be in units of cubic meters per second.

A default implementation is included in this PR the pulls data from the USGS NWIS IV data service using hydrotools.nwis_client. This is enabled by default and does not require modifying configuration files.

Hook signature:

class ModelHooks:
    @hookspec(firstresult=True)
    def ngen_cal_model_observations(
        self,
        id: str,
        start_time: datetime,
        end_time: datetime,
        simulation_interval: pd.Timedelta,
    ) -> pd.Series:
        """
        `id`: USGS gage station id
        `start_time`, `end_time`: inclusive simulation time range
        `simulation_interval`: time (distance) between simulation values
        """

Users who would like to load observation data from a file, for example, should implement this plugin hook and specify it as a model plugin in your ngen.cal configuration file. For example:

# ngen_cal_observation_plugin.py
from __future__ import annotations

import pandas as pd
from ngen.cal import hookimpl

import typing
if typing.TYPE_CHECKING:
    from datetime import datetime

class NgenCalObservationPlugin:
    @hookimpl(trylast=True)
    def ngen_cal_model_observations(
        self,
        id: str,
        start_time: datetime,
        end_time: datetime,
        simulation_interval: pd.Timedelta,
    ) -> pd.Series:
        df = pd.read_csv("local_observations.csv")

        assert id in df["sites"].values, f"{id} not in {df['sites'].values}"
        df = df[(df["sites"] == id) && (df["time"] <= start_time ]

        df = df[(df["time"] >= start_time) && (df["time"] <= end_time)]

        df.set_index("time", inplace=True)

        ds = df["value"].resample(simulation_interval).nearest()
        return ds

Likewise, if you would like to, instead, save the observations used during calibration, you can implement this plugin as a "wrapper" style plugin. For example:

# ngen_cal_observation_writer_plugin.py
from __future__ import annotations

import pandas as pd
from ngen.cal import hookimpl

import typing
if typing.TYPE_CHECKING:
    from datetime import datetime

class NgenCalObservationWriterPlugin:
    @hookimpl(wrapper=True)
    def ngen_cal_model_observations(
        self,
        id: str,
        start_time: datetime,
        end_time: datetime,
        simulation_interval: pd.Timedelta,
    ) -> pd.Series | None:
        ds = yield
        if ds is None:
            return
        ds.to_csv(f"ngen_cal_observations-{id}.csv")
        return ds

Related to #93
Related to #111

Additions

  • ngen_cal_model_observations - called during each calibration iteration to provide truth / observation values in the form of a pandas.Series, indexed by time with a record every simulation_interval.

simulation_interval: pd.Timedelta = pd.Timedelta(3600, unit="s")
obs = self._hooks.ngen_cal_model_observations(
id=location.station_id,
start_time=start_time,
Copy link
Member Author

@aaraney aaraney Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reality I think the caller should pass start_time + 1 dt (e.g. 3600s) since the model outputs will not contain values for the actual start_time (left exclusive). I don't think this is a problem now, but just wanted to note it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter and things will break if we don't properly account for this. This only works right now b.c. csv_output's dt is 300s which means the first value is start_time + 5min. When we resample the simulation output to the hour using .resample("h").first() the simulated value at 5min is backfilled to start_time and lines up with the nwis observations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely heavily depends on how the data is "merged", in search.py _objective_function. Currently this done with a pandas merge with left_index=True, right_index=True which should result in a data frame of the overlapping indicies.

Copy link
Member

@hellkite500 hellkite500 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things to consider, we can chat about this directly.

The returned pandas Series should be in units of cubic meters per
second.

`id`: USGS gage station id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id doesn't strictly have to be USGS gage id, but more generally some identifier linked/referenced to the hydrofabric...we just happen to use gage id's for now. Perhaps this should ultimately be the Hydrofabric feature id instead of the obs id? Or we may need a feature id and a linked id?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we pass a Nexus feature here which should be able to embody additional context extracted from the hydrofabric which can be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions! Glad to hear that we are thinking on the same page. Two thoughts preface with: I went with this design b.c. I decided to just keep it as simple as possible for now. Since we are pre-1.0, I figured we might change this as we learn things along the way.

  1. I'd originally written up a complex object that encapsulates what Mike's group refers to a point-of-interest. Like you are saying, the thinking being that the id for the model data is likely not the same id for the observation (truth) data. I decided against that to err on the side of simplicity for now. I am open changing that now or in the future.
  2. I also though about passing a Nexus, but thinking back to previous conversations we'd had about keeping things model agnostic, it seemed passing a NextGen specific idea went against that. If we ever do go in the direction of something like a POI interface as a solution to this problem, I figured we could always create a type that implements the POI interface for a Nexus for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversation with @hellkite500, in theory a HydroLocation or Nexus makes the most sense here (really a Nexus b.c. its more general). We have decided to go with Nexus for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Nexus is the right way to go for now, it encapsulates a (eventually) list of hydrolocation which can reference different entities.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: NOAA-OWP/hypy#37

python/ngen_cal/src/ngen/cal/ngen_hooks/observations.py Outdated Show resolved Hide resolved
@aaraney aaraney force-pushed the obs-hook branch 2 times, most recently from 2169ed2 to 2d13584 Compare August 15, 2024 18:11
Copy link
Member

@hellkite500 hellkite500 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up PR, we should create a "user" hook which is not in core which reads a typical USGS csv/tabular file and create a quick example of how to use that instead of the default hook via config

@hellkite500 hellkite500 merged commit 22b2a71 into NOAA-OWP:master Aug 15, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ngen.cal Related to ngen.cal package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants